Jun 20, 2019, foss4gnl, Delft edzer.github.io/foss4gnl/ (rmd)

What is R?

r-project.org: “R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.”

  • high-level language, meant for interacting with data
  • implements a programming language (called S)
  • extensible through a packaging system
  • emphasises reproducibility: over time, and accross platforms
  • written in C, interfaces to Fortran, C, C++, Java, Python, …

Why do people use R?

I honestly don’t know! But maybe:

  • to get stuff done, relatively quickly
  • to get everything done in one place (=interoperability)
  • to do things reproducibly (= scripted)
  • to be able to easily share & communicate what you did (Mac/Windows/Linux)
  • to become part of a data science community
  • but … Python? See e.g. Norm Mattloff’s R vs Python for Data Science
  • doing stuff reproducibly is more important than whether you use R or Python, and the two languages integrate well
  • competition encourages ecosystem evolution

R: history

  • S was created by John Chambers in 1978, then at Bell Labs
  • (S-Plus was commercial, AT&T, Lucent, Insightful, Tibco)
  • R was a research project, started in 1992 by Ross Ihaka and Robert Gentlemen from Univ. of Aucland, to re-implement S in Scheme
  • beta released in 2000,
  • Most academic S-Plus contributors moved to R early 2000s
  • (now, Tibco sells TERR, an alternatively licensed runtime compatible to R)

R Spatial: history

  • pre-2003: many spatial packages existed in S-Plus, including spatstat, spdep, maptools, gstat
  • 2003: R Spatial workshop set up by Roger Bivand, agreed on a need for a common set of classes and methods
  • 2005: sp released, used S4, rgdal (Tim Keitt/RB; links to GDAL, PROJ) followed
  • 2009: rgeos interface to GEOS (Colin Rundel/RB, GSOC)
  • 2010: raster for scalable raster analysis (Robert Hijmans)
  • 2016: sf, using S3, implementes simple features and links to tidyverse
  • 2018: stars for raster and vector data cubes (spatiotemporal, array data)

2017:

(c) @allison_horst

Simple Feature Access

Package sf “implements” simple feature access (SFA, OGC 06-103r4), but this leaves a lot open: e.g. how to handle

  • geometry generating functions, e.g. st_union on two tables (pairs of sets)
  • operations using geodetic (long/lat) coordinates: what are X/Y/Z?
  • the (self-inflicted) axis order mess
  • how do attributes relate to geometries, and how does this affect queries involving geometry and attributes?

sf has attribute-geometry relationships

attribute-geometry relationship values:

  • constant: values is valid everywhere throughout the geometry
  • aggregate: value is an aggregation over the geometry
  • identity: value is constant and unique
library(sf)
## Linking to GEOS 3.7.0, GDAL 2.4.0, PROJ 5.2.0

nc = read_sf(system.file("gpkg/nc.gpkg", package="sf"))
pt = st_sfc(st_point(c(-81.18642, 36.15643)), crs = st_crs(nc))
st_intersection(nc["SID74"], pt) 
## although coordinates are longitude/latitude, st_intersection assumes that they are planar
## Warning: attribute variables are assumed to be spatially constant
## throughout all geometries
## Simple feature collection with 1 feature and 1 field
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -81.18642 ymin: 36.15643 xmax: -81.18642 ymax: 36.15643
## epsg (SRID):    4267
## proj4string:    +proj=longlat +datum=NAD27 +no_defs
## # A tibble: 1 x 2
##   SID74                 geom
##   <dbl>          <POINT [°]>
## 1     4 (-81.18642 36.15643)

What are data cubes?

  • They are annotated array data.
  • Arrays map from \(n\) dimensions to \(p\) attributes: \[(D_1,D_2,...,D_n)\rightarrow (A_1,A_2,...,A_p)\]
  • array dimensions are enumerated sets, and may
    • represent a set of entities (cities, persons, species, spectral bands)
    • discretize a continuous variable, such as
      • spatial dimension,
      • time,
      • time-of-day,
      • frequency or wavelength,
      • duration (e.g., time-to-forecast)

“Cube”:

We use the word cube short for

  • “real” cube: three dimensions
  • more than three dimensions: hypercube
  • two dimensions: matrix, raster

a special case is:

  • one dimensional cube: table with records (data.frame, tibble)

“Spatial” data cube

Cube dimensions can refer to spatial dimensions in several ways:

  • each continuous spatial dimension (x, y, z) maps to a single cube dimension (raster data cubes), e.g. regular grids
    • \(c(i) = o + (i-1) \times \delta, \ \ i = 1,2,...,n\)
    • index space is continuous: integer \(i\) implies \([i,i+1)\)
    • this means that every coordinate value maps to a unique index (unlike polygons)
  • all spatial dimensions map to a single cube dimension (vector data cubes)
    • cube dimension is a set of points/lines/polygons
  • combinations of these: e.g. origin-destination matrices

Raster - vector conversion

From raster to vector:

  • polygons or points are given:
    • sample (“extract”)
    • aggregate, e.g. mean or sum over polygon
  • polygons or lines are the result:
    • polygonize
    • contour

From vector to raster:

  • (points/polygons:) rasterize
  • (point sample:) interpolate
  • (point pattern:) density

Raster types

R package stars

  • a stars object is a set (list) of arrays with possibly varying type (numeric, integer, factor, logical, character, list)
  • uses R’s native arrays for that: all array Ops work (pixel math)
  • supports arrays with measurement units and time stamps (throug GDAL, libnetcdf)
  • has a dimensions attribute “table”
  • does everything in memory, unless you use stars_proxy, which does nothing in memory
  • lets you define arbitrary number of dimensions
  • slice & dice with [, or filter
  • map/reduce with st_apply: apply a function to a (set of) dimension(s)
  • aggregate for spatial / temporal aggregations
  • supports rectilinear (all dims), rotated, sheared, and curvilinear grids (raster), and simple features (vector)

R package stars (2)

  • implements raster (GDAL, netcdf) and vector (sfc) data cubes
  • full support for PROJ coordinate reference systems
  • time support: POSIXct, Date, as well as PCICt (360, 365, noleap)
  • integration with some tidyverse verbs (filter, select, mutate, geom_stars)
  • integrates with gstat (spatial, spatiotemporal geostatistics)

library(stars)
tif = system.file("tif/L7_ETMs.tif", package = "stars")
(x = read_stars(tif))
## stars object with 3 dimensions and 1 attribute
## attribute(s):
##   L7_ETMs.tif    
##  Min.   :  1.00  
##  1st Qu.: 54.00  
##  Median : 69.00  
##  Mean   : 68.91  
##  3rd Qu.: 86.00  
##  Max.   :255.00  
## dimension(s):
##      from  to  offset delta                       refsys point values    
## x       1 349  288776  28.5 +proj=utm +zone=25 +south... FALSE   NULL [x]
## y       1 352 9120761 -28.5 +proj=utm +zone=25 +south... FALSE   NULL [y]
## band    1   6      NA    NA                           NA    NA   NULL

plot(x)

library(stars)
library(ggplot2)
ggplot() + geom_stars(data = x) + coord_equal() + facet_wrap(~band) +
  theme_void()  +  scale_x_discrete(expand=c(0,0)) + scale_y_discrete(expand=c(0,0))

stars_proxy objects

  • contain pointers to the data (file, URL), which by no means uniquely defines an array
  • use a strategy to go through them (chunking: currently only spatial)
  • are lazy: only compute when data is requested
  • can read downsampled arrays (e.g. to plot): optimises execution order of call stack
  • can time-compose e.g. a time stack of NetCDF files, with a time slice in each file

granule = system.file("sentinel/S2A_MSIL1C_20180220T105051_N0206_R051_T32ULE_20180221T134037.zip", package = "starsdata")
base_name = strsplit(basename(granule), ".zip")[[1]]
s2 = paste0("SENTINEL2_L1C:/vsizip/", granule, "/", base_name, ".SAFE/MTD_MSIL1C.xml:10m:EPSG_32632")
(p = read_stars(s2, proxy = TRUE))
## stars_proxy object with 1 attribute in file:
## $`MTD_MSIL1C.xml:10m:EPSG_32632`
## [1] "SENTINEL2_L1C:/vsizip//home/edzer/R/x86_64-pc-linux-gnu-library/3.6/starsdata/sentinel/S2A_MSIL1C_20180220T105051_N0206_R051_T32ULE_20180221T134037.zip/S2A_MSIL1C_20180220T105051_N0206_R051_T32ULE_20180221T134037.SAFE/MTD_MSIL1C.xml:10m:EPSG_32632"
## 
## dimension(s):
##      from    to offset delta                       refsys point values    
## x       1 10980  3e+05    10 +proj=utm +zone=32 +datum...    NA   NULL [x]
## y       1 10980  6e+06   -10 +proj=utm +zone=32 +datum...    NA   NULL [y]
## band    1     4     NA    NA                           NA    NA   NULL
object.size(p)
## 7336 bytes

system.time(plot(p))

##    user  system elapsed 
##   0.767   0.165   0.420

ndvi = function(x) (x[4]-x[3])/(x[4]+x[3])
system.time(plot(st_apply(p, c("x", "y"), ndvi)))

##    user  system elapsed 
##   1.067   0.140   0.705

OSGEO - R

This annoys me:

wgs84 = "+proj=longlat +datum=WGS84"
st_linestring(rbind(c(-179,50), c(179,50))) %>%
    st_sfc(crs = wgs84) %>%
    st_bbox
## xmin ymin xmax ymax 
## -179   50  179   50

and this too:

list(rbind(c(0,66), c(-120,66), c(120,66), c(0,66))) %>%
    st_polygon() %>% st_sfc(crs = wgs84) -> polar_circle
st_point(c(0,90)) %>% st_sfc(crs = wgs84) -> north_pole
st_intersects(polar_circle, north_pole)
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
## Sparse geometry binary predicate list of length 1, where the predicate was `intersects'
##  1: (empty)

R global

  • is an upcoming activity, to handle spherical data, well, as spherical data.
  • will have to reconsider the bounding box concept
  • will help plotting e.g. orthographic maps
  • will use Google’s s2geometry.io library, which also powers G Maps, G Earth, G Earth Engine and G Bigquery GIS

OGC:

OSGEO, R, open software standards, culture

  • Nobody in the entire R community talks about (open) standards.
  • Being open doesn’t make a standard good:
  • Good standards follow (or fix) open source implementations (geo: WMS, SOS, GPKG, SFA). So that it works, and the world gets better.
  • Bad standards don’t follow open source implementations, but introduce complexity and/or software or platform lock-in
  • OGC should only produce good standards, but can’t, because its organisational model (closed, membership levels) allows strong partners to push for software or platform lock-ins: plutocracy
  • OGC also evolves standards, which breaks things, and introduces complexity (axis order: anyone?)
  • When will OGC start abandoning standards, and simplify others?

What to do?

  • Keep away from the bad standards, and create good (software) alternatives
  • Don’t accept that OGC plays your game: look at STAC, WFS-3, openEO, CityJSON, …
  • Innovate through open source implementations, not from starting to write a standards document!

Concluding

  • R has a rich, evolving ecosystem of packages for spatial data science, largely reusing OSGEO components
  • healthy competition: quality follows from (re-)use
  • Data cubes include vector data cubes, and are not constrained to raster/imagery data
  • Get involved, on github.com/r-spatial or r-spatial.org
  • Spatial Data Science with R is an upcoming book, written online: r-spatial.org/book
  • Remember that as opposed to paper documents, only software actually works, and
  • Innovate by writing software, don’t look back, and
  • Always challenge those who try to tell you what your foundations are
  • We need to bring up the discussion about what standards lock-in means

edzer.github.io/foss4gnl/

@edzerpebesma